# importamos paquetes para leer y convertir dataset
using DataFrames
using CSV
houses = CSV.File("data/newhouses.csv") |> DataFrame
20,640 rows × 10 columns (omitted printing of 4 columns)
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | |
|---|---|---|---|---|---|---|
| Float64 | Float64 | Float64 | Float64 | Float64? | Float64 | |
| 1 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 |
| 2 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 |
| 3 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 |
| 4 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 |
| 5 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 |
| 6 | -122.25 | 37.85 | 52.0 | 919.0 | 213.0 | 413.0 |
| 7 | -122.25 | 37.84 | 52.0 | 2535.0 | 489.0 | 1094.0 |
| 8 | -122.25 | 37.84 | 52.0 | 3104.0 | 687.0 | 1157.0 |
| 9 | -122.26 | 37.84 | 42.0 | 2555.0 | 665.0 | 1206.0 |
| 10 | -122.25 | 37.84 | 52.0 | 3549.0 | 707.0 | 1551.0 |
| 11 | -122.26 | 37.85 | 52.0 | 2202.0 | 434.0 | 910.0 |
| 12 | -122.26 | 37.85 | 52.0 | 3503.0 | 752.0 | 1504.0 |
| 13 | -122.26 | 37.85 | 52.0 | 2491.0 | 474.0 | 1098.0 |
| 14 | -122.26 | 37.84 | 52.0 | 696.0 | 191.0 | 345.0 |
| 15 | -122.26 | 37.85 | 52.0 | 2643.0 | 626.0 | 1212.0 |
| 16 | -122.26 | 37.85 | 50.0 | 1120.0 | 283.0 | 697.0 |
| 17 | -122.27 | 37.85 | 52.0 | 1966.0 | 347.0 | 793.0 |
| 18 | -122.27 | 37.85 | 52.0 | 1228.0 | 293.0 | 648.0 |
| 19 | -122.26 | 37.84 | 50.0 | 2239.0 | 455.0 | 990.0 |
| 20 | -122.27 | 37.84 | 52.0 | 1503.0 | 298.0 | 690.0 |
| 21 | -122.27 | 37.85 | 40.0 | 751.0 | 184.0 | 409.0 |
| 22 | -122.27 | 37.85 | 42.0 | 1639.0 | 367.0 | 929.0 |
| 23 | -122.27 | 37.84 | 52.0 | 2436.0 | 541.0 | 1015.0 |
| 24 | -122.27 | 37.84 | 52.0 | 1688.0 | 337.0 | 853.0 |
| 25 | -122.27 | 37.84 | 52.0 | 2224.0 | 437.0 | 1006.0 |
| 26 | -122.28 | 37.85 | 41.0 | 535.0 | 123.0 | 317.0 |
| 27 | -122.28 | 37.85 | 49.0 | 1130.0 | 244.0 | 607.0 |
| 28 | -122.28 | 37.85 | 52.0 | 1898.0 | 421.0 | 1102.0 |
| 29 | -122.28 | 37.84 | 50.0 | 2082.0 | 492.0 | 1131.0 |
| 30 | -122.28 | 37.84 | 52.0 | 729.0 | 160.0 | 395.0 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
names(houses) # mostramos los nombres de las columnas del df
10-element Vector{String}:
"longitude"
"latitude"
"housing_median_age"
"total_rooms"
"total_bedrooms"
"population"
"households"
"median_income"
"median_house_value"
"ocean_proximity"
Se usará VegaLite para representar los datos en un mapa.
] add JSON
Updating registry at `C:\Users\pmore\.julia\registries\General.toml` Resolving package versions... Updating `C:\Users\pmore\.julia\environments\v1.8\Project.toml` [682c06a0] + JSON v0.21.3 No Changes to `C:\Users\pmore\.julia\environments\v1.8\Manifest.toml`
] add VegaLite
Resolving package versions... Installed JSONSchema ─ v1.0.1 Installed NodeJS ───── v1.3.0 Installed FilePaths ── v0.8.3 Installed Vega ─────── v2.3.1 Installed VegaLite ─── v2.6.0 Updating `C:\Users\pmore\.julia\environments\v1.8\Project.toml` [112f6efa] + VegaLite v2.6.0 Updating `C:\Users\pmore\.julia\environments\v1.8\Manifest.toml` [8fc22ac5] + FilePaths v0.8.3 [7d188eb4] + JSONSchema v1.0.1 [2bd173c7] + NodeJS v1.3.0 [239c3e63] + Vega v2.3.1 [112f6efa] + VegaLite v2.6.0 Precompiling project... ✓ NodeJS ✓ FilePaths ✓ JSONSchema ✓ Vega ✓ VegaLite 5 dependencies successfully precompiled in 9 seconds. 279 already precompiled.
] add VegaDatasets
Resolving package versions... Installed TableShowUtils ─────── v0.2.5 Installed Nullables ──────────── v1.0.0 Installed Polynomials ────────── v3.2.1 Installed IterableTables ─────── v1.0.0 Installed TextParse ──────────── v1.0.2 Installed Quadmath ───────────── v0.5.6 Installed DoubleFloats ───────── v1.2.2 Installed GenericLinearAlgebra ─ v0.3.5 Installed VegaDatasets ───────── v2.1.1 Updating `C:\Users\pmore\.julia\environments\v1.8\Project.toml` [0ae4a718] + VegaDatasets v2.1.1 Updating `C:\Users\pmore\.julia\environments\v1.8\Manifest.toml` [497a8b3b] + DoubleFloats v1.2.2 [14197337] + GenericLinearAlgebra v0.3.5 [1c8ee90f] + IterableTables v1.0.0 [4d1e1d77] + Nullables v1.0.0 [f27b6e38] + Polynomials v3.2.1 [be4d8f0f] + Quadmath v0.5.6 [5e66a065] + TableShowUtils v0.2.5 [e0df1984] + TextParse v1.0.2 [0ae4a718] + VegaDatasets v2.1.1 Precompiling project... ✓ Nullables ✓ Quadmath ✓ GenericLinearAlgebra ✓ TableShowUtils ✓ IterableTables ✓ Polynomials ✓ DoubleFloats ✓ TextParse ✓ VegaDatasets 9 dependencies successfully precompiled in 11 seconds. 284 already precompiled.
using JSON
using VegaLite
using VegaDatasets
cali_shape = JSON.parsefile("data/california-counties.json")
VV = VegaDatasets.VegaJSONDataset(cali_shape,"data/california-counties.json")
Vega JSON Dataset
@vlplot(width=500, height=300) +
@vlplot(
mark={
:geoshape,
fill=:black,
stroke=:white
},
data={
values=VV,
format={
type=:topojson,
feature=:cb_2015_california_county_20m
}
},
projection={type=:albersUsa},
)+
@vlplot(
:circle,
data=houses,
projection={type=:albersUsa},
longitude="longitude:q",
latitude="latitude:q",
size={value=12},
color="median_house_value:q"
)
Fraccionemos el datset en grupos. Cada grupo serán múltiplos de 100,000 (según el precio de cada propiedad).
bucketprice = Int.(div.(houses[!,:median_house_value],100000))
20640-element Vector{Int64}:
4
3
3
3
3
2
2
2
2
2
2
2
2
⋮
0
1
1
1
1
0
1
0
0
0
0
0
extrema(bucketprice)
(0, 5)
insertcols!(houses, 3, :cprice => bucketprice) # insertar una nueva columna en el dataframe
20,640 rows × 11 columns (omitted printing of 4 columns)
| longitude | latitude | cprice | housing_median_age | total_rooms | total_bedrooms | population | |
|---|---|---|---|---|---|---|---|
| Float64 | Float64 | Int64 | Float64 | Float64 | Float64? | Float64 | |
| 1 | -122.23 | 37.88 | 4 | 41.0 | 880.0 | 129.0 | 322.0 |
| 2 | -122.22 | 37.86 | 3 | 21.0 | 7099.0 | 1106.0 | 2401.0 |
| 3 | -122.24 | 37.85 | 3 | 52.0 | 1467.0 | 190.0 | 496.0 |
| 4 | -122.25 | 37.85 | 3 | 52.0 | 1274.0 | 235.0 | 558.0 |
| 5 | -122.25 | 37.85 | 3 | 52.0 | 1627.0 | 280.0 | 565.0 |
| 6 | -122.25 | 37.85 | 2 | 52.0 | 919.0 | 213.0 | 413.0 |
| 7 | -122.25 | 37.84 | 2 | 52.0 | 2535.0 | 489.0 | 1094.0 |
| 8 | -122.25 | 37.84 | 2 | 52.0 | 3104.0 | 687.0 | 1157.0 |
| 9 | -122.26 | 37.84 | 2 | 42.0 | 2555.0 | 665.0 | 1206.0 |
| 10 | -122.25 | 37.84 | 2 | 52.0 | 3549.0 | 707.0 | 1551.0 |
| 11 | -122.26 | 37.85 | 2 | 52.0 | 2202.0 | 434.0 | 910.0 |
| 12 | -122.26 | 37.85 | 2 | 52.0 | 3503.0 | 752.0 | 1504.0 |
| 13 | -122.26 | 37.85 | 2 | 52.0 | 2491.0 | 474.0 | 1098.0 |
| 14 | -122.26 | 37.84 | 1 | 52.0 | 696.0 | 191.0 | 345.0 |
| 15 | -122.26 | 37.85 | 1 | 52.0 | 2643.0 | 626.0 | 1212.0 |
| 16 | -122.26 | 37.85 | 1 | 50.0 | 1120.0 | 283.0 | 697.0 |
| 17 | -122.27 | 37.85 | 1 | 52.0 | 1966.0 | 347.0 | 793.0 |
| 18 | -122.27 | 37.85 | 1 | 52.0 | 1228.0 | 293.0 | 648.0 |
| 19 | -122.26 | 37.84 | 1 | 50.0 | 2239.0 | 455.0 | 990.0 |
| 20 | -122.27 | 37.84 | 1 | 52.0 | 1503.0 | 298.0 | 690.0 |
| 21 | -122.27 | 37.85 | 1 | 40.0 | 751.0 | 184.0 | 409.0 |
| 22 | -122.27 | 37.85 | 1 | 42.0 | 1639.0 | 367.0 | 929.0 |
| 23 | -122.27 | 37.84 | 1 | 52.0 | 2436.0 | 541.0 | 1015.0 |
| 24 | -122.27 | 37.84 | 0 | 52.0 | 1688.0 | 337.0 | 853.0 |
| 25 | -122.27 | 37.84 | 1 | 52.0 | 2224.0 | 437.0 | 1006.0 |
| 26 | -122.28 | 37.85 | 1 | 41.0 | 535.0 | 123.0 | 317.0 |
| 27 | -122.28 | 37.85 | 0 | 49.0 | 1130.0 | 244.0 | 607.0 |
| 28 | -122.28 | 37.85 | 1 | 52.0 | 1898.0 | 421.0 | 1102.0 |
| 29 | -122.28 | 37.84 | 1 | 50.0 | 2082.0 | 492.0 | 1131.0 |
| 30 | -122.28 | 37.84 | 1 | 52.0 | 729.0 | 160.0 | 395.0 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
@vlplot(width=500, height=300) +
@vlplot(
mark={
:geoshape,
fill=:black,
stroke=:white
},
data={
values=VV,
format={
type=:topojson,
feature=:cb_2015_california_county_20m
}
},
projection={type=:albersUsa},
)+
@vlplot(
:circle,
data=houses,
projection={type=:albersUsa},
longitude="longitude:q",
latitude="latitude:q",
size={value=12},
color="cprice:n" # this is different
)
] add Clustering
Resolving package versions... Installed NearestNeighbors ─ v0.4.13 Installed Distances ──────── v0.10.7 Installed Clustering ─────── v0.14.3 Updating `C:\Users\pmore\.julia\environments\v1.8\Project.toml` [aaaa29a8] + Clustering v0.14.3 Updating `C:\Users\pmore\.julia\environments\v1.8\Manifest.toml` [aaaa29a8] + Clustering v0.14.3 [b4f34e82] + Distances v0.10.7 [b8a86587] + NearestNeighbors v0.4.13 Precompiling project... ✓ Distances ✓ NearestNeighbors ✓ Clustering 3 dependencies successfully precompiled in 6 seconds. 293 already precompiled.
using Clustering
houses = dropmissing(houses) # eliminar lineas con valores faltantes
20,433 rows × 11 columns (omitted printing of 4 columns)
| longitude | latitude | cprice | housing_median_age | total_rooms | total_bedrooms | population | |
|---|---|---|---|---|---|---|---|
| Float64 | Float64 | Int64 | Float64 | Float64 | Float64 | Float64 | |
| 1 | -122.23 | 37.88 | 4 | 41.0 | 880.0 | 129.0 | 322.0 |
| 2 | -122.22 | 37.86 | 3 | 21.0 | 7099.0 | 1106.0 | 2401.0 |
| 3 | -122.24 | 37.85 | 3 | 52.0 | 1467.0 | 190.0 | 496.0 |
| 4 | -122.25 | 37.85 | 3 | 52.0 | 1274.0 | 235.0 | 558.0 |
| 5 | -122.25 | 37.85 | 3 | 52.0 | 1627.0 | 280.0 | 565.0 |
| 6 | -122.25 | 37.85 | 2 | 52.0 | 919.0 | 213.0 | 413.0 |
| 7 | -122.25 | 37.84 | 2 | 52.0 | 2535.0 | 489.0 | 1094.0 |
| 8 | -122.25 | 37.84 | 2 | 52.0 | 3104.0 | 687.0 | 1157.0 |
| 9 | -122.26 | 37.84 | 2 | 42.0 | 2555.0 | 665.0 | 1206.0 |
| 10 | -122.25 | 37.84 | 2 | 52.0 | 3549.0 | 707.0 | 1551.0 |
| 11 | -122.26 | 37.85 | 2 | 52.0 | 2202.0 | 434.0 | 910.0 |
| 12 | -122.26 | 37.85 | 2 | 52.0 | 3503.0 | 752.0 | 1504.0 |
| 13 | -122.26 | 37.85 | 2 | 52.0 | 2491.0 | 474.0 | 1098.0 |
| 14 | -122.26 | 37.84 | 1 | 52.0 | 696.0 | 191.0 | 345.0 |
| 15 | -122.26 | 37.85 | 1 | 52.0 | 2643.0 | 626.0 | 1212.0 |
| 16 | -122.26 | 37.85 | 1 | 50.0 | 1120.0 | 283.0 | 697.0 |
| 17 | -122.27 | 37.85 | 1 | 52.0 | 1966.0 | 347.0 | 793.0 |
| 18 | -122.27 | 37.85 | 1 | 52.0 | 1228.0 | 293.0 | 648.0 |
| 19 | -122.26 | 37.84 | 1 | 50.0 | 2239.0 | 455.0 | 990.0 |
| 20 | -122.27 | 37.84 | 1 | 52.0 | 1503.0 | 298.0 | 690.0 |
| 21 | -122.27 | 37.85 | 1 | 40.0 | 751.0 | 184.0 | 409.0 |
| 22 | -122.27 | 37.85 | 1 | 42.0 | 1639.0 | 367.0 | 929.0 |
| 23 | -122.27 | 37.84 | 1 | 52.0 | 2436.0 | 541.0 | 1015.0 |
| 24 | -122.27 | 37.84 | 0 | 52.0 | 1688.0 | 337.0 | 853.0 |
| 25 | -122.27 | 37.84 | 1 | 52.0 | 2224.0 | 437.0 | 1006.0 |
| 26 | -122.28 | 37.85 | 1 | 41.0 | 535.0 | 123.0 | 317.0 |
| 27 | -122.28 | 37.85 | 0 | 49.0 | 1130.0 | 244.0 | 607.0 |
| 28 | -122.28 | 37.85 | 1 | 52.0 | 1898.0 | 421.0 | 1102.0 |
| 29 | -122.28 | 37.84 | 1 | 50.0 | 2082.0 | 492.0 | 1131.0 |
| 30 | -122.28 | 37.84 | 1 | 52.0 | 729.0 | 160.0 | 395.0 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
X = houses[!, [:median_house_value]]; # Select in-place. No crea una copia
size(X) # una columna como vector
(20433, 1)
C = kmeans(Matrix(X)', 5) # K-means necesita una columna como vector
KmeansResult{Matrix{Float64}, Float64, Int64}([90087.84070484582 245057.80102381483 … 346571.8560277537 164580.78314200346], [3, 4, 4, 4, 4, 2, 4, 2, 2, 2 … 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [7.785306024210205e8, 1.4228061862265015e8, 3.0560375777893066e7, 2.7792465977386475e7, 1.911312512741089e7, 6.072379703819122e8, 2.244092743514221e9, 1.3379508329818726e7, 3.370088584299774e8, 2.573521479915161e8 … 4.8014272497621155e8, 2.92825995742733e8, 6.508702719013214e8, 6.74395602889862e7, 7.135394542096939e8, 1.4370832476475906e8, 1.6868400617444992e8, 4.89364874713707e6, 2.902882746079445e7, 473124.8352432251], [5675, 4493, 1600, 2306, 6359], [5675, 4493, 1600, 2306, 6359], 1.2139304361478732e13, 20, true)
insertcols!(houses, 3, :cluster_k => C.assignments) # creamos una función anónima con la asignacion de clusters
20,433 rows × 12 columns (omitted printing of 5 columns)
| longitude | latitude | cluster_k | cprice | housing_median_age | total_rooms | total_bedrooms | |
|---|---|---|---|---|---|---|---|
| Float64 | Float64 | Int64 | Int64 | Float64 | Float64 | Float64 | |
| 1 | -122.23 | 37.88 | 3 | 4 | 41.0 | 880.0 | 129.0 |
| 2 | -122.22 | 37.86 | 4 | 3 | 21.0 | 7099.0 | 1106.0 |
| 3 | -122.24 | 37.85 | 4 | 3 | 52.0 | 1467.0 | 190.0 |
| 4 | -122.25 | 37.85 | 4 | 3 | 52.0 | 1274.0 | 235.0 |
| 5 | -122.25 | 37.85 | 4 | 3 | 52.0 | 1627.0 | 280.0 |
| 6 | -122.25 | 37.85 | 2 | 2 | 52.0 | 919.0 | 213.0 |
| 7 | -122.25 | 37.84 | 4 | 2 | 52.0 | 2535.0 | 489.0 |
| 8 | -122.25 | 37.84 | 2 | 2 | 52.0 | 3104.0 | 687.0 |
| 9 | -122.26 | 37.84 | 2 | 2 | 42.0 | 2555.0 | 665.0 |
| 10 | -122.25 | 37.84 | 2 | 2 | 52.0 | 3549.0 | 707.0 |
| 11 | -122.26 | 37.85 | 2 | 2 | 52.0 | 2202.0 | 434.0 |
| 12 | -122.26 | 37.85 | 2 | 2 | 52.0 | 3503.0 | 752.0 |
| 13 | -122.26 | 37.85 | 2 | 2 | 52.0 | 2491.0 | 474.0 |
| 14 | -122.26 | 37.84 | 5 | 1 | 52.0 | 696.0 | 191.0 |
| 15 | -122.26 | 37.85 | 5 | 1 | 52.0 | 2643.0 | 626.0 |
| 16 | -122.26 | 37.85 | 5 | 1 | 50.0 | 1120.0 | 283.0 |
| 17 | -122.27 | 37.85 | 5 | 1 | 52.0 | 1966.0 | 347.0 |
| 18 | -122.27 | 37.85 | 5 | 1 | 52.0 | 1228.0 | 293.0 |
| 19 | -122.26 | 37.84 | 5 | 1 | 50.0 | 2239.0 | 455.0 |
| 20 | -122.27 | 37.84 | 5 | 1 | 52.0 | 1503.0 | 298.0 |
| 21 | -122.27 | 37.85 | 5 | 1 | 40.0 | 751.0 | 184.0 |
| 22 | -122.27 | 37.85 | 5 | 1 | 42.0 | 1639.0 | 367.0 |
| 23 | -122.27 | 37.84 | 1 | 1 | 52.0 | 2436.0 | 541.0 |
| 24 | -122.27 | 37.84 | 1 | 0 | 52.0 | 1688.0 | 337.0 |
| 25 | -122.27 | 37.84 | 5 | 1 | 52.0 | 2224.0 | 437.0 |
| 26 | -122.28 | 37.85 | 1 | 1 | 41.0 | 535.0 | 123.0 |
| 27 | -122.28 | 37.85 | 1 | 0 | 49.0 | 1130.0 | 244.0 |
| 28 | -122.28 | 37.85 | 1 | 1 | 52.0 | 1898.0 | 421.0 |
| 29 | -122.28 | 37.84 | 1 | 1 | 50.0 | 2082.0 | 492.0 |
| 30 | -122.28 | 37.84 | 5 | 1 | 52.0 | 729.0 | 160.0 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
@vlplot(width=500, height=300) +
@vlplot(
mark={
:geoshape,
fill=:black,
stroke=:white
},
data={
values=VV,
format={
type=:topojson,
feature=:cb_2015_california_county_20m
}
},
projection={type=:albersUsa},
)+
@vlplot(
:circle,
data=houses,
projection={type=:albersUsa},
longitude="longitude:q",
latitude="latitude:q",
size={value=12},
color="cluster_k:n"
)